19 research outputs found

    New decoding algorithms for Hidden Markov Models using distance measures on labellings

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Existing hidden Markov model decoding algorithms do not focus on approximately identifying the sequence feature boundaries.</p> <p>Results</p> <p>We give a set of algorithms to compute the conditional probability of all labellings "near" a reference labelling <it>λ </it>for a sequence <it>y </it>for a variety of definitions of "near". In addition, we give optimization algorithms to find the best labelling for a sequence in the robust sense of having all of its feature boundaries nearly correct. Natural problems in this domain are <it>NP</it>-hard to optimize. For membrane proteins, our algorithms find the approximate topology of such proteins with comparable success to existing programs, while being substantially more accurate in estimating the positions of transmembrane helix boundaries.</p> <p>Conclusion</p> <p>More robust HMM decoding may allow for better analysis of sequence features, in reasonable runtimes.</p

    On the Treewidth of Dynamic Graphs

    Full text link
    Dynamic graph theory is a novel, growing area that deals with graphs that change over time and is of great utility in modelling modern wireless, mobile and dynamic environments. As a graph evolves, possibly arbitrarily, it is challenging to identify the graph properties that can be preserved over time and understand their respective computability. In this paper we are concerned with the treewidth of dynamic graphs. We focus on metatheorems, which allow the generation of a series of results based on general properties of classes of structures. In graph theory two major metatheorems on treewidth provide complexity classifications by employing structural graph measures and finite model theory. Courcelle's Theorem gives a general tractability result for problems expressible in monadic second order logic on graphs of bounded treewidth, and Frick & Grohe demonstrate a similar result for first order logic and graphs of bounded local treewidth. We extend these theorems by showing that dynamic graphs of bounded (local) treewidth where the length of time over which the graph evolves and is observed is finite and bounded can be modelled in such a way that the (local) treewidth of the underlying graph is maintained. We show the application of these results to problems in dynamic graph theory and dynamic extensions to static problems. In addition we demonstrate that certain widely used dynamic graph classes naturally have bounded local treewidth

    Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

    Get PDF
    BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms

    Optimizing Multiple Spaced Seeds for Homology Search

    No full text
    Abstract. Optimized spaced seeds improve sensitivity and specificity in localhomology search [1]. Recently, several authors [2-4] have shown that multiple seeds can have better sensitivity and specificity than single seeds. We describea linear programming-based algorithm to optimize a set of seeds. Our algorithm offers a performance guarantee: the sensitivity of a chosen seed set is at least 70%of what can be achieved, in most reasonable models of homologous sequences. Our method achieves performance comparable to that of a greedy algorithm, butour work gives this area a mathematical foundation

    The Most Probable Annotation Problem in HMMs and Its Application to Bioinformatics

    No full text
    Hidden Markov models (HMMs) are often used for biological sequence annotation. Each sequence feature is represented by a collection of states with the same label. In annotating a new sequence, we seek the sequence of labels that has highest probability. Computing this most probable annotation was shown NP-hard by Lyngsø and Pedersen [15]. We improve their result by showing that the problem is NP-hard for a specific HMM, and present efficient algorithms to compute the most probable annotation for a large class of HMMs, including abstractions of models previously used for transmembrane protein topology prediction and coding region detection. We also present a small experiment showing that the maximum probability annotation is more accurate than the labeling that results from simpler heuristics.

    Amino Acid Classification and Hash Seeds for Homology Search

    No full text
    Spaced seeds have been extensively studied in the homology search field. A spaced seed can be regarded as a very special type of hash function on k-mers, where two k-mers have the same hash value if and only if they are identical at the w (w &lt;k) positions designated by the seed. Spaced seeds substantially increased the homology search sensitivity. It is then a natural question to ask whether there is a better hash function (called hash seed) that provides better sensitivity than the spaced seed. We study this question in the paper. We propose a strategy to classify amino acids, which leads to a better hash seed. Our results raise a new question about how to design the best hash seed

    Quality of Algorithms for Sequence Comparison

    No full text
    Pair-wise sequence alignment is the basic method of comparative analysis of proteins and nucleic acids. Studying the results of the alignment one has to consider two questions: (1) did the program find all the interesting similarities (“sensitivity”) and (2) are all the found similarities interesting (“selectivity”). Definitely, one has to specify, what alignments are considered as the interesting ones. Analogous questions can be addressed to each of the obtained alignments: (3) which part of the aligned positions are aligned correctly (“confidence”) and (4) does alignment contain all pairs of the corresponding positions of compared sequences (“accuracy”). Naturally, the answer on the questions depends on the definition of the correct alignment. The presentation addresses the above two pairs of questions that are extremely important in interpreting of the results of sequence comparison
    corecore